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Abstract 

We consider the El Farol bar problem, also known as the minority game (W. B. Arthur, 
The American Economic Review, 84(2): 406-411 (1994), D. Challet and Y.C. Zhang, Phys- 
ica A, 256:514 (1998)). We view it as an instance of the general problem of how to configure 
the nodal elements of a distributed dynamical system so that they do not "work at cross 
purposes", in that their collective dynamics avoids frustration and thereby achieves a pro- 
vided global goal. We summarize a mathematical theory for such configuration applicable 
when (as in the bar problem) the global goal can be expressed as minimizing a global energy 
function and the nodes can be expressed as minimizers of local free energy functions. We 
show that a system designed with that theory performs nearly optimally for the bar problem. 

1 Introduction 

In many distributed dynamical systems there is little centralized communication and control 
among the individual nodal elements. Despite this handicap, typically we wish to design the 
system so that its dynamical behavior has some desired form. Often the quality of that behavior 
can be expressed as a (potentially path-dependent) global energy function, G. The associated 
design problem is particularly interesting when we can also express the individual nodal elements 
rj as minimizers of "local" energy functions 7^. Given G, this reduces the problem to determining 
the optimal associated {7^}. 

Because the argument lists of the 7^ may overlap, what action rj should take at time t to 
minimize 7^ may depend on what actions the other nodes take at t. Since without binding 
contracts 77 cannot know those other actions ahead of time, it cannot assuredly minimize 7^ in 
general. We are particularly interested in cases where each n addresses this problem by using 
machine-learning (ML) techniques to determine its actions. (In its use of such techniques that 
trade off exploration and exploitation, such an 77 often approximates a stochastic node following 
the distribution that minimizes 77's free energy — see below.) In such cases the challenge is to 
choose the {7^} so that the associated system of good (but suboptimal) ML-based nodes induces 
behavior that best minimizes the provided function G. 

We refer to a system designed this way, or more generally to a system investigated from 
this perspective, as a Collective INtelligence (COIN) jl4|, To agree with bar problem and 
game-theory terminology, we refer to the nodes as agents, G as (minus) world utility, and the 
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{"fn} as (minus) private utilities. As an example of this terminology, a spin glass in which each 
spin r] is at an energy minimum given the states of the other spins is a "Nash equilibrium" of an 
associated "game" , a game formed by identifying each agent with a spin r\ and its associated 
private utility function with 77's energy function. 

Arthur's bar problem Q can be viewed as a problem in designing COINs. Loosely speaking, 
in this problem at each time t each agent 77 decides whether to attend a bar by predicting, based 
on its previous experience, whether the bar will be too crowded to be "rewarding" at that time, 
as quantified by a reward function Rud-t/- The greedy nature of the agents frustrates the global 
goal of maximizing G = Rud-t/ at t. This is because if most agents think the attendance will 
be low (and therefore choose to attend), the attendance will actually be high, and vice- versa. 
This frustration effect makes the bar problem particularly relevant to the study of the physics of 
emergent behavior in distributed systems J|, 0, [sL |L fll ^2|, |l6| . 

In COIN design we try to avoid such effects by determining new utilities {7,7} so that all 
agents trying to minimize those new utilities means that G is also minimized. (Of course, we 
wish to determine the {7^} without first explicitly solving for the minimum of G.) As an analogy, 
economic systems sometimes have a "tragedy of the commons" (TOC) Jl0|], where each agent's 
trying maximize its utility results in collective behavior that minimizes each agent's utility, and 
therefore minimizes minimizes G = Rudw One way the TOC is avoided in real-world 
economies is by reconfiguring the agents' utility functions from {RjjD-ri} to a set of {7^} that 
results in better G, for example via punitive legislation like anti-trust regulations. Such utility 
modification is exactly the approach used in COIN design. 

We recently applied such COIN design to network packet routing [Hj]. In conventional packet 
routing each router uses a myopic shortest path algorithm (SPA) , with no concern for side-effects 
of its decisions on an external world utility like global throughput (e.g., for whether those decisions 
induce bottlenecks). We found that a COIN-based system has significantly better throughput 
than does a conventional SPA |H| , even when the agents in that system had to predict quantities 
(e.g., delays on links) that were directly provided to the SPA. 

In this paper we confront frustration effects more directly, in the context of the bar problem. 
In the next section we present (a small portion of) the theory of COINs. Then we present 
experiments applying that theory to the distributed control of the agents in the bar problem. 
Those experiments indicate that by using COIN theory we can avoid the frustration in the bar 
problem and thereby achieve almost perfect minimization of the global energy. 

2 Theory of COINs 

We consider the state of the system across a set of consecutive time steps, t E {0, 1, ...}. Without 
loss of generality, all relevant characteristics of agent 77 at time t — including its internal pa- 
rameters at that time as well as its externally visible actions — are encapsulated by a Euclidean 
vector £ , the state of agent 77 at time t. £ ( is the set of the states of all agents at t, and £ is 
the state of all agents across all time. 

So world utility is G(£), and when 77 is an ML algorithm "striving to increase" its private 
utility, we write that utility as 777(C)- The mathematics is generalized beyond such ML-based 
agents through an artificial construct: the personal utilities {<??}(£)}• We restrict attention to 
utilities of the form J2t ^MC J f° r reward functions Rt- 

We are interested in systems whose dynamics is deterministic. (This covers in particular any 
system run on a digital computer.) We indicate that dynamics by writing £ = G(£ „)■ So all 
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characteristics of an agent r\ at t = that affects the ensuing dynamics of the system, including 
in particular its private utility if it has one, must be included in - 

Definition: A system is factored if for each agent 77 individually, 

5„(C(C „)) > 9n(C(Q) * G(C(( ))>G(C(Q), (1) 

for all pairs £ and £ that differ only for node 77. 

For a factored system, the side effects of a change to r/'s t — state that increases its personal 
utility cannot decrease world utility. If the separate agents have high personal utilities, by luck 
or by design, then they have not frustrated each other, as far as G is concerned. 

The definition of factored is carefully crafted. In particular, it does not concern changes in 
the value of the utility of agents other than the one whose state is varied. Nor does it concern 
changes to the states of more than one agent at once. Indeed, consider the following alternative 
desideratum to having the system be factored: any change to £ that simultaneously improves 
all agents' ensuing utilities must also improve world utility. Although it seems quite reasonable, 
there are systems that obey this desideratum and yet quickly evolve to a minimum of world 
utility. For example, any system that has G(Q = J^^ffrj(C) obeys this desideratum, and yet as 
shown below, such systems entail a TOC in the the bar problem. 

For a factored system, when every agents' personal utility is optimizal, given the other agents' 
behavior, world utility is at a critical point In game-theoretic terms, optimal global behavior 
corresponds to the agents' reaching a personal utility Nash equilibrium for such systems ||. 
Accordingly, there can be no TOC for a factored system. 

As a trivial example, if g v = G V77, then the system is factored, regardless of C. However 
there exist other, often preferable sets of {g v }, as illustrated in the following development. 

Definition: The (t = 0) effect set of node 77 at (, C^**(Q, is the set of all components £ , t , 

for which f (C(C )V,t' 7^ 0. C^' with no specification of C is defined as U^cCf/^ (C)- 

Definition: Let a be a set of agent-time pairs. CL V (() is C modified by "clamping" the 
states corresponding to all elements of a to some arbitrary pre-fixed value, here taken to be 0. 
The wonderful life utility (WLU) for a at C is defined as: 

WLU a (0 = G(0-G(GL a (C)) ■ (2) 

In particular, the WLU for the effect set of node 77 is G(0 — G(CL„ c //(0). 

— 1^1) — 

77's effect set WLU is analogous to the change world utility would undergo had node 77 "never 
existed" . (Hence the name of this utility - cf. the Frank Capra movie.) However CL(.) is a purely 
"fictional" , counter-factual mapping, in that it produces a new £ without taking into account the 
system's dynamics. The sequence of states produced by the clamping operation in the definition 
of the WLU need not be consistent with the dynamical laws embodied in C. This is a crucial 
strength of effect set WLU. It means that to evaluate that WLU we do not try to infer how the 
system would have evolved if node 77's state were set to at time and the system re-evolved. So 
long as we know G and the full £, and can accurately estimate what agent-time pairs comprise 
Cf " , we know the value of 77's effect set WLU — even if we know nothing of the details of the 
dynamics of the system. 

Theorem 1: A COIN is factored if g n = WLU c e ff Vt? (proof in @). 

If our system is factored with respect to personal utilities {g^}, then we want each £ Q to 
be a state with as high a value of g v {C(( )) as possible. Assuming 77 is ML-based and able to 
achieve close to the largest possible value of any private utility specified in £ , we would likely 
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be in such a state of high personal utility if 77's private utility were set to the associated personal 
utility: 7,, = . privatc utility = ^ r >' Enforcing this equality our problem becomes determining 
what {7,,} the agents will best be able to maximize while also causing dynamics that is factored 
with respect to the {7^}. 

Now regardless of C(.), both j v — G V77 and j v = WLU^ss Vry are factored systems (for 
9ri = 7j;)- However since each agent is operating in a large system, it may experience difficulty 
discerning the effects of its actions on G when G sensitively depends on all components of the 
system. Therefore each r\ may have difficulty learning how to achieve high 7^ when 7^ = G. 
This problem can be obviated using effect set WLU as the private utility since the subtraction 
of the clamped term removes some of the "noise" of the activity of other agents, leaving only the 
underlying "signal" of how the agent in question affects the utility. 

We can quantify this signal/noise effect by comparing the ramifications on the private util- 
ities arising from changes to £ Q with the ramifications arising from changes to C- > where "r\ 
represents all nodes other than r\. We call this quantification the learnability A,, )7l] (C): 

l|V C D 7„(C(C ))|| 
\ 7 (0 = -^=^ — ■ (3) 

— ?7,0 — i u 

Theorem 2: Let a be a set containing C^f . Then 



A,, G (C) |]V C . G(C(C „)) - V C _ G(CL ff (C(C )))|| 

— n.U — — n.O — 1*-> 



proof in MA 



This ratio of gradients should be large whenever a is a small part of the system, so that the 
clamping won't affect G's dependence on much, and therefore that dependence will approxi- 
mately cancel in the denominator term. In such cases, WLU will be factored just as G is, but far 
more learnable. The experiments presented below illustrate the power of this fact in the context 
of the bar problem, where one can readily approximate effect set WLU and therefore use a utility 
for which the conditions in Thm.'s 1 and 2 should approximately hold. 



3 Experiments 

We modified Arthur's original problem to be more general, and since we are not interested here in 
directly comparing our results to those in (TJ, ^, ^, || , we use a more conventional ML algorithm 
than the ones investigated in [jl], ^, |], ^|, jD|, an algorithm that approximately minimizes free 
energy. These modifications are similar to those in Q| . 

There are N agents, each picking one of seven nights to attend a bar the following week, a 
process that is then repeated. In each week, each agent's pick is determined by its predictions of 
the associated rewards it would receive. Each such prediction in turn is based solely upon the 
rewards received by the agent in those preceding weeks in which it made that pick. 

The world utility is G{() = Et R c(CJ, where i? G (C t ) = ELi 0k (**(£.*)), is the 

total attendance on night k at week t, 4>k{y) = ot^y exp (— y/c); and c and the {ak} are real- 
valued parameters. Intuitively, this G is the sum of the "world rewards" for each night in each 
week. Our choice of 4>k{-) means that when too few agents attend some night in some week, the 
bar suffers from lack of activity and therefore the world reward is low. Conversely, when there 
are too many agents the bar is overcrowded and the reward is again low. 
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Two different a's are investigated. One treats all nights equally; a = [1111111]. The 
other is only concerned with one night; d — [0 00700 0]. c — 6 and N is 4 times the number 
of agents needed to allow c agents to attend the bar on each of the seven nights, i.e., there are 
4x6x7= 168 agents. For the purposes of the CL operation, an agent's action at time t is 
represented as a unary seven-dimensional vector, so the "clamped pick" is (0,0,0,0,0,0,0). 

Each 77 has a 7-dimensional vector representing its estimate of the reward it would receive 
for attending each night of the week. At the end of each week, the component of this vector 
corresponding to the night just attended is proportionally adjusted towards the actual reward 
just received. At the beginning of the succeeding week, to trade off exploration and exploitation, 
77 picks the night to attend randomly using a Boltzmann distribution with 7 energies e; (77) given 
by the components of 77's estimated rewards vector, and with a temperature decaying in time. 
This distribution of course minimizes the expected free energy of 77, E(e(rj))—TS, or equivalanetly 
maximizes entropy S subject to having expected energy given by T. This learning algorithm is 
similar to Claus and Boutilier's independent learner algorithm ||. 

We considered three agent reward functions, using the same learning parameters (learning 
rate, Boltzmann temperature, decay rates, etc.) for each. The first reward function had 7^ = 
G V77, i.e., agent 77's reward function equals Rq- The other two reward functions are: 

Uniform Division (UD): Rud iv (( .) = <Pd„{xd v ((,t))/x d ((,t) 
Wonderful Life (WL): Rwl-^0 = R G (C t ) - RcicL^CJ) , 

where is the night picked by 77. The original version of the bar problem in the physics 
literature Q is a special case where there are two "nights" in the week (one of which corresponds 
to "staying at home"); a is uniform; 4>k{xk) = rnini{xi)5k argmini ( Xi y, and Rjjdt] is used. 

The conventional Rud reward is a "natural" reward function to use; each night's total re- 
ward is uniformly divided among the agents attending that night. In particular, if g v = 7,, = 
Ylt RuD;n(Ci t)i G(Q = J2rj9v(0> so tne "alternative desideratum" discussed above is met. In 
contrast, Rq results in the system meeting the desideratum of factoredness. Rq suffers from 
poor learnability, at least in comparison to that of Rwl', by Eq. |3| the ratio of learnabilities is 
approximately 11 (see Q for details). As another point of comparison, to evaluate Rwl each 
agent only needs to know the total attendance on the night it attended, unlike with Rq, which 
requires centralized communication concerning all 7 nights. 

Finally, in the bar problem the only interaction between any pair of agents is indirect, via 
small effects on each others' rewards; each 77's action at time t has its primary effect on 77's own 
future actions. So the effect set of 77's entire sequence of actions is well-approximated by £ . In 
turn, since that sequence is all that is directly affected by the choice of 77's private utility, the 
effect set of( „ . .,. can be approximated by ( , and therefore so can the effect set of 

— 7?,U;private — utility ^ —Vi 

the full £ Q . Therefore we can approximate the effect set WLU for 77 as J2t Rwlu\ti{(, t . pick )- 
So we expect that use of Rwlu-m should result in (close to) factored dynamics. 

Figure |l| graphs world reward value as a function of time, averaged over 50 runs, for all three 
reward functions, for both a = [1111111] and d = [0 00700 0]. Performance 
with Rq eventually converges to the global optimum. This agrees with the results obtained 
by Crites for the bank of elevators control problem. Systems using Rwl also converged 
to optimal performance. This indicates that in the bar problem 7^'s effect set is sufficiently 
well-approximated by 77's future actions so that the conclusions of theorems 1 and 2 hold. 

However since the Rwl reward has better "signal to noise" than than the Rq reward (see 
above), convergence with Rwl is far quicker than with Rq. Indeed, when a — [0 00700 0], 
systems using Rq converge in 1250 weeks, which is 5 times worse than the systems using Rwl- 
When a = [1111111] systems take 6500 weeks to converge with Rq, which is more than 30 
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Figure 1: Average world reward when a = [0 00700 0] (left) and when a = 
(right). In both plots the top curve is Rwl, middle is Rq, and bottom is Rjjd- 
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times worse than the time with Rwl ■ This slow convergence of systems using Rq is a result of 
the reward signal being "diluted" by the large number of agents in the system. 

In contrast to the behavior for COIN theory-based reward functions, use of conventional Rjjd 
reward results in very poor world reward values that deteriorated with time. This is an instance 
of the TOC. For example, when a = [0 7 0], it is in every agent's interest to attend the 
same night — but their doing so shrinks the world reward "pie" that must be divided among 
all agents. A similar TOC occurs when a is uniform. This is illustrated in fig. || which shows 
a typical example of {xk((,t)} for each of the three reward functions for t = 2000. In this 
example using Rwl results in optimal performance, with 6 agents each on 6 separate nights, and 
the remaining 132 agents on one night (average world reward of 13.05). In contrast, Rjjd results 
in a uniform distribution of agents and has the lowest average world reward (3.25). Use pf Rq 
results in an intermediate average world reward (6.01). 
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Figure 2: Typical daily attendance when a = [1 1 1 1 1 1 1] for Rwl, Rg, and Rjjd, respectively. 
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Figure 3: Behavior of each reward function with respect to the number of agents for 
a = [0 7 0]. 

Figure || shows how performance at t = 2000 scales with N for each reward function for 
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a = [0 00700 0]. Systems using Rjjd perform poorly regardless of N. Systems using Rq 
perform well when N is low. As N increases however, it becomes increasingly difficult for the 
agents to extract the information they need from Rq- (This problem is significantly worse for 
uniform a.) Systems using Rwl overcome this learnability problem because Rwl is based on 
clamping of all agents but one, and therefore is not appreciably affected by TV. 



4 Conclusion 

The theory of COINs is concerned with distributed systems of controllers in which each controller 
strives to minimize an associated local energy function. That theory suggest how to initialize 
and then update those local energy functions so that the resultant global dynamics will achieve 
a global goal. In this paper we present a summary of the part of that theory dealing with how to 
initialize the local energy functions. We present experiments applying that theory to the control 
of individual agents in difficult variants of Arthur's El Farol Bar problem. In those experiments, 
the COINs quickly achieve nearly optimal performance, in contrast to the other systems we 
investigated. This demonstrates that even when the conditions required by the initialization 
theorems of COIN theory do not hold exactly, they often hold well enough so that they can 
be applied with confidence. In particular the COINs automatically avoid the tragedy of the 
commons inherent in the bar problem. 
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